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Test bias has traditionally been defined in terms of 
an outside criterion measure of the performance being predicted by 
the test. In test construction, vhere criterion-related validity data 
are usually not collected until after the test is completed, 
assessment of bias in the absence of outside criteria had become a 
vital issue. Here, an unbiased test item is defined as one in vhich, 
for persons with the same ability in the areas being measured, the 
probability of a correct response on the item is the same regardless 
of the population group membership of the individual* The total score 
on a test or subtest containing the item can be used to define groups 
of persons having the same ability. Once the ability groups have been 
defined, a modified chi sguare procedure is used to evaluate each 
item in the test for possible bias. ^ While hypotheses suggested by* 
such an evaluation should be investigated further before making 
conclusive statements concerning the source of bias, results reported 
in this study support the validity of the method as a procedure for 
assessing bias vhen outside criterion measures are unavailable. 
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VALIDATING A PROCEDURE FOR ASSESSING BIAS IN TEST ITEMS 

1 

IN THE ABSENCE OF AN OUTSIDE CRITERION 
Janice Scheuneman 
The Psychological Corporation 



During the past few years the problem of bias In testing has be- 
cooe an Increasingly important issue* In most of the research which 
has been done, bias refers to the fair use of tests and has thus been 
defined in terms of an outside criterion measure of the^performAnce 
being predicted by the test* Recently, however, there has been growing 
interest is assessing bias when such criteria are; not available* In 
test construction in particular, where criterion-related validity data 
are usually not collected until after the test is completed, assessmeiit 
of bias in the absence of outside criteria has become a vital issue* 
If tests are to be built which may someday prove to be unMased in;M 
it is Important to identify potentiially biased- JLtems^^duringit^^ 
struction process when test content is still flexible juid items may 
still be modified or replaced* In addition, the identlficat;ion of such 
items is a first step in isolating sources of bias in the test content, 
information which is potentially useful to researchers in bt^^ 
interested in population group differences as well as for ibest construc- 
tion purposes in the future* 
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In the method discussed In this paper, an unbiased Item Is defined 
as one In which, for persons with the same ability In the areas being 
measured, the probability of a correct response on the Item Is the same 
•regardless of the population group membership of the individual. In 
cases where no outside criterion measures of ability are available, 
the total score on a test or subtest containing the Item can be used 
to define groups of persons having the same ability. Assuming that the 
test Is reasonably valid and reliable and Is homogeneous with respect 
to the ability being measured, the definition can be restated as 
follows: An Item Is unbiased if, for all Individuals belonging to the 
same ability group as defined by the total score on the test or subtest 
containing the Item, the proportion of Individuals getting the Item 
correct Is the same for each population group being considered. Once 
the ability groups have been defined, a modified chl square procedure 
l8 used to evaluate each Item In the test for possible bias. 

Table 1 gives a computational example of the procedure. This pro- 
cedure differs from the conventional chl square test primarily in the 
computation of the expected frequencies. The first column gives the score 
ranges which were selected for this Item. In general, score ranges have 
been selected by dividing the distribution of correct responses 
approximately at the quartlles or qulntlles for the smaller-slsed group. 
The next set of columns are the frequency distributions of scores both 
within and across the two population groups. The next columns give the 
obtained frequf^ncies— the number of children within each population/ 
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score range group who got the Item correct. The next column, pro- 
portion correct. Is computed across population groups* Within each 
score range group the total number of correct responses Is divided by 
the total number of children scoring In that range. This proportion Is 
then used to obtain the expected frequency for each cell. According 
to the definition. If the Item Is unblasied, the proportion of correct 
responses should be the same regardless of population group membership. 
Hence, the expected number of correct responses Is obtained by multiply- 
ing the proportion £ by the number of children In the population group 
who scored In that range. In this Item, a comparison of the obtained 
and expected number of correct responses will quickly show that Black 
children are doing consistently more poorly than expected on this Item. 
The Item would probably be considered biased depending on the cut-off 
point chosen. 

Developed Initially as part of the Item analysis program for th^ 
Metropolitan Readiness Tests , the procedure was used to screen the Item 
pool for Items which were potentially biased (Scheuneman, 1975). As' 
a rapid screening device It proved quite satisfactory. The method Is 
computationally simple and permits easy establishment of a decision rule 
for rejection of Items. It la not necessary to assume that the groups 
are representative of their respective populations, nor are any 
normality assumptions required. Very easy Items can be evaluated with- 
out difficulty, although very difficult Items present problems. Any Item 
where one population group produces fewer than ten correct responses 
cannot be evaluated at all. Fairly large samples are required for the 
method, probably about 100 per population group. 

4 
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If the method presented here Is a valid procedure, examination 
of the items selected as biased should yield further information about 
possible sources of the bias. During the 1975 standardization of the 
Metropolitan Readiness Tests , a study designed to produce norms for 
large cities was conducted, with 11 cities from across the country 
participating (Psych. Corp., 1976). Items in Level II Form P of the test 
were analyzed for bias using data collected during this program. Level II 
consists of a total of 97 items from four '*skill areas" — Auditory, Visual, 
Language, and Quantitative, each of which is made up of two subtests. 
The sample consisted of 4441 First Grade children of whom 1653 were 
identified as White, 1502 as Black, 470 Mexican American ^ 161 Puerto 
Rican and 123 Oriental. A total of 532 children belonged to other 
population groups or were unidentified and were not included in the 
analysis. The items were screened by using a 5 x r chi square, where 
there were five* population groups and r score range groups, r ranging 
from two to five. Items found to be biased were examined further using 
tests with two, three, or four population groups at a time as seemed 
indicated in order to get at the patterns of differences between the 
groups. „ 

Results 

From the 97 items, 34 items were found to be biased using the 

2 

five population groups together. With five of these items, significance 
appeared to result from the particular choice of Interval, that is, 
when the score intervals were changed, the results were no longer 
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significant • Nine Items, although consistently showing bias with 
different Intervals and with different combinations of population 
groups, revealed no clear pattern of results. Another five Items 
showed few significant differences when the population groups were 
tested two at a time, but Instead appeared to rank the groups by per- 
formance, with significance resulting only between the extreme groups. 
With the remaining 14 Items clear patterns of bias were found, either 
for or against one or two population groups* 

The Items In the Auditory area yielded some of the most easily 
Interpre table results and nicest examples. In the Beginning Consonants 
Test, for example. Oriental children were found to have undue difficulty 
discriminating between an L and an R. In the other subtest, Sound- 
Letter Correspondence, children were asked to select the letter which 
corresponds to the beginning sound of a word which Is pictured on the 
test and named by the teacher. One of these Items, In which the stimulus 
word was dog , was found to be biased against both of the Spanish-speaking 
groups. On Investigation, It was noted that the Spanish word for dog 
was perro and that the dlstractors Included a £. 

In this Visual Skill area, eight Items were found to be biased. 
Although two or three of these seemed clearly biased In favor of Oriental 
children, generally the patterns of differences were not clear. When 
examined for content, howeveVf five of the biased Items were found to 
Involve artificial letters or letter-like shapes although only eight 
Items Involving these letters were Included In the 26 Item test. This 
finding Is at least suggestive of a possible source of bias In the test 
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whlch would warrant further Investigation. 

The results for the Visual tests were further complicated, however, 
in that the lower range was so much restricted for the Oriental children. 
No Oriental child scored low on this test with the result that the 
score range for the lowest group was unusually wide, possibly covering 
as many as 14 or 15 points. (A lower limit of at least ten correct 
responses per cell was observed in all cases.) With only a few children 
at the top of the interval for one group versus a large number of 
children acrosu a wide range of scores in the other groups, the 
assumption of equal ability within the score range is no longer very 
tenable. (A similar distribution at the upper end of the scale, however, 

y 

does not appear to create problems. Within the top scoring groups, the 
differences between expected and obtained frequencies is seldom very 
large, even though the upper range of scores may vary widely among the 
population groups.) 

The Language area consists of only 18 items, of which 11 were found [M 

3 

to be biased, seven of them beyond the .05 level. Not too surprisingly^ 

^ 

most of these items were found to be biased against Spanish -speaking or ^ J 

Oriental children, with the biased Items involving more complex gram- ' 
matical structures than the unbiased items. 



Metropolitan Readiness Tests at Level II, the quantitative items were 
broken into two subtests. Quantitative Concepts and Quantitative Operas 
The Quantitative Concepts Test contains items measuring conMpta such 



Quantitative area, but in constructing the final version of the 



In the item analysis most of the biased items were in the 




as part-^hole relations and one**co one -correspondence, some spatial 
perception Items j and some simple figure analogies. The Quantitative 
Operations Test primarily contains fairly straightforward counting and 
simple computation problems. Looking at the two subtests separately, 
five Items from the nine Item Quantitative Concepts Test (557. of the 
Items) and four from the 15 Item Quantitative Operations Test (27% of 
the Items) were found to be biased. 

When the results from the Quantitative Concepts and the School 
Language tests are examined together, a pattern appears which suggests 
that Black children have trouble with terms such as "fewer," "closer," 
"larger*" This pattern was discernible In the Item analysis data, but 
not so clearly visible as In this sample where all four Items concern-* 
Ing such terms were found to be biased against Black children. 

In the Item analysis program potentially biased Items were usually 
discarded, but for a number of reasons there were too few remain ing 
Items In some areas and five Items which were apparently biased, but 
otherwise satisfactory, were Included In this form of the test. Of 
these, three again appeared as biased, while the stem of a fourth Item 
had been extensively revised In an effort to make the task Involved 
clearer^-posslbly removing the source of the bias In the original Item. 

Summary and Conclusion 

Any method for assessing bias which uses only Information con«* 
talned within the test Is open to criticism concerning the validity 
of the procedure. Using Internal statistics alone. It Is not possible 
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to determine 1£ the method is in fact isolating items which are biased 
or simply selecting items more or less at random. While some false 
positives are to be expected, examination of the content of the items 
should reveal some specific item content or pattern of content vhich 
is interpretable in light of knowledge beyond that yielded by the test* 
While hypotheses suggested by such an examination should be investigated 
further before making conclusive statements concerning the source of 
bias, results such as those reported in this study support the validity 
of the method as a procedure for assessing bias when outside criterion 
measures are unavailable. 



9 



References 



The Psychological Corporation. Norms tables for large city school systems 
(MRT Research Research No, 3)» New York: Author, 1976, 

Scheuneman, J, A new method of assessing bias in test items . Paper 
presented at the meeting of the American Educational Research 
Association, Washington, D. C, April 1975. 



10 



-10- 



Footndtes 

1. This paper is a slightly modified version of a paper presented at 
the meeting of the American Educational Research Association as part of 

a symposium entitled "The Assessment of Bias In the Absence of an Outside 
Criterion," San Francisco, April 1976. 

2. In determining If an Item was biased or unbiased, a standard chl square 
table was entered with the obtained chl square value and (r-l)(k-l) degrees 
of freedom where r Is the number of score groups and k Is the number of 
population groups. If the probability of the obtained chl square was read 
to be less than .30, the Item was termed biased. It should be noted 

that .30 Is not the probability of rejecting an unbiassed Item In the 
hypothesis testing sense. It Is an arbitrarily selected cutting point 
which serves to Isolate those Items which are most likely to be biased 
by the definition given here. This point was selected during the Item 
analysis program for eliminating potentially biased Items with the Idea that 
It was better to reject an unbiased Item than to retain a biased one, while 
still. not eliminating so many Items that the Item pool would become too 
small. The .30 cutting point seemed to strike a good balance and was retained 
for this study. Further work Is still needed to determine the various 
statistical properties of the test. 

3, Again the .05 level refers to the cutoff points when using the chl square 
tables rather than the probability of rejecting an unbiased item. 
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